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n a no/. 


qqt nin qic 

Zd 1 ,uiy,dio 


7 . 66% 


1 1 n 01 n 

iiy.tsiy 


557 






Q nQC CQQ QQQ 

d,uyo,byd,yod 








1 n nco 79c 
lU,Uby, 1 Zo 


on cn/i QC/i 
yU,bU4,0b4 


11 13% 


11 13% 


997 m Q Q 1 c 
Zd 1 ,uiy ,dlo 




119 819 


557 




9K 


Q QQC CQQ QQQ 

d,uyo,byd,yod 




s 070 


9n 


9C Q7C 7C/1 

zo,y / b, / D4 


on cn/i QC/i 
yu,bU4,ob4 


9Q 7noi 
ZO. /UvO 


9Q 7n0A 

ZO. iKJVO 


997 m Q Q 1 c 
Zd 1 ,uiy ,dlo 


7 66% 


119 819 


557 




9K 


Q nQC CQQ QQQ 

d,uyo,byd,yod 




7219 


in 
40 


97 7C/1 Q91 

Z / , / b4,0Zl 


on cn/i QC/i 

yu,bU4,ob4 


Qn cooi 
dU.bOvO 


Qn cqoa 
dU.bbyo 


997 m Q Q1 c 

Zd 1 ,uiy ,dlo 


7 CCOA 


119 819 


557 




9K 


Q nQC CQQ QQQ 

d,uyo,byd,yod 




14 659 




729 395 


on cn/i QC/i 
yu,bU4,ob4 






997 m Q Q 1 c 
Zd 1 ,uiy ,dlo 


7 CCOA 


119 819 


101 




9K 
26 


Q nQC CQQ QQQ 

d,uyo,byd,yod 




14,416 


in 
10 


1 q one nriQ 
io,yub,uuo 


on cn/i QC/i 

yu,bU4,ob4 


Qn onoi 

zu.oyvo 


9n QQ0A 

zu.oyvo 


997 m Q Q1 c 

Zd 1 ,uiy ,dlo 


7.66% 


119 819 


101 




26 


Q nOC CQQ QOQ 

d,uyo,byd,yod 


4 


14,189 


20 


OI "7"7Q "7QQ 

di, / /y, / dZ 


qq cr\A ocyi 
yU,bU4,Ob4 


QC 1 1 0/, 
do. 11% 


35.11% 


qqt nin qic 

Zd 1 ,uiy,dio 


7 . 66% 


Tin qi n 

iiy.tsiy 


101 




26 


Q QQC CQQ QQQ 

d,uyo,byd,yod 


4 


13,956 


40 


Q/i m c ri9/t 
d4,Ulb,Ud4 


on cn/i QC/i 
yU,bU4,0b4 


37.58% 


37.58% 


997 m Q Q 1 c 
Zd 1 ,uiy ,dlo 


7 CCOA 


119 819 


101 




26 


Q nOC CQQ QQQ 

d,uyo,byd,yod 


8 


29,936 


5 


1 C 1 Q ^ QQ 

l,DlZ,4Zd 


qq cn^ ocyi 
yU,bU4,Ob4 


1.67% 


1.67% 


qqt nin qic 

Zd /,uiy,dio 


7 . 66% 


1 1 n 01 n 

iiy.tsiy 


20 






Q nQC CQQ QQQ 

d,uyo,byd,yod 




29 300 




QQ CQQ Q7/1 

zy,bzd,y / 4 


on cn/i QC/i 
yu,bU4,0b4 


32 62% 


99 C90A 

dZ.bZTO 


997 m Q Q 1 c 
Zd 1 ,uiy ,dlo 




119 819 


20 




26 


9 HQ 11 ; fiQ9 QR9 


g 


28 830 


20 


A7 9D1 fM9 




52 15% 


52 15% 


997 01 Q 91 R 


7 66% 


119 819 


20 




26 


3,095,693,983 


8 


28,375 


40 


50,810,670 


90,504,854 


56.14% 


56.14% 


237,019,315 


7.66% 


119,819 


20 




26 


3,095,693,983 


16 


60,752 


5 


2,763,911 


90,504,854 


3.05% 


3.05% 


237,019,315 


7.66% 


119,819 


10 




26 


3,095,693,983 


16 


59,864 


10 


41,399,291 


90,504,854 


45.74% 


45.74% 


237,019,315 


7.66% 


119,819 


10 




26 


3,095,693,983 


16 


58,594 


20 


59,688,346 


90,504,854 


65.95% 


65.95% 


237,019,315 


7.66% 


119,819 


10 




26 


3,095,693,983 


16 


57,649 


40 


60,828,479 


90,504,854 


67.21% 


67.21% 


237,019,315 


7.66% 


119,819 


10 




26 


3,095,693,983 


32 


122,405 


5 


5,248,209 


90,504,854 


5.80% 


5.80% 


237,019,315 


7.66% 


119,819 


0 




26 


3,095,693,983 


32 


121,461 


10 


66,220,523 


90,504,854 


73.17% 


73.17% 


237,019,315 


7.66% 


119,819 


0 




26 


3,095,693,983 


32 


119,670 


20 


80,920,976 


90,504,854 


89.41% 


89.41% 


237,019,315 


7.66% 


119,819 


0 




26 


3,095,693,983 


32 


117,139 


40 


86,787,800 


90,504,854 


95.89% 


95.89% 


237,019,315 


7.66% 


119,819 


0 



Table S2 



Model 
Organism 


ID 


Genome Size 


#0f 
chromosomes 


ploidy 


Kingdom 
(Domain) 


Phylum 


Class 


Order 


Family 


Genus 


Species 


Strain/version 


M.jannaschii 




1,664,970 




1 


Archaea 


Euryarchaeota 


Methanococci 


Methanococcales 


Methanocaldococcaceae 


Methanocaldococcus 


jannaschii 




C. hycl rogGnoformans 










Bacteria 


Firmicutes 


Clostridia 


Clostridiales 


Peptococc ac e ae 


Carboxydothermus 


hyd roge not o rmans 


lUo UlUHUooJ. 


E.coli 




A ^*JO ^7^ 






Eubacteria 


Proteobacteria 


Gammaproteobacteria 


Enterobacteriales 


Ente robacteriaceae 


Escherichia 


coli 


r\±Z IVI*j_Ldoj Uluj/ 1 /y 


Y.pestis 




4,653,728 




1 


Eubacteria 


Proteobacteria 


Gammaproteobacteria 


Enterobacteriales 


Ente robacteriaceae 


Yersinia 


pestis 


C092 uid34 


B.anthracis 






' 




Bacteria 


Firmicutes 


Bacilli 


Baci Hales 


Bacillaceae 


Bacillus 


anthracis 


Ames uid309 


A. mirum 










Bacteria 


Actinobacteria 


Acti no bacteria 


Actinomycetales 


Actinosynnemataceae 


Acti nosyn ne ma 


mirum 


nt;r/i/i'aR'?7 niHi 07nt: 


yeast 


7 


12,157,105 


16 


1 


Fungi 


Ascomycota 


Saccharomycetes 


Saccharomycetales 


Saccharomycetaceae 


Saccharomyces 


cerevisiae 


Uidl28 


Y.lipolytica 


8 


20,502,981 


6 


1 


Fungi 


Ascomycota 


Saccharomycetes 


Saccharomycetales 


Dipodascaceae 


Yarrowia 


lipolytica 


CLIB122_uidl2414 


slime mold 


9 


34,338,145 


6 


1 


Amoebozoa 


Mycetozoa 


Dictyostelia 


Dictyosteliida 


Dictyosteliidae 


Dictyostelium 


discoideum 


May/8/13 (downloaded) 


Red bread mold 


10 


41,037,538 


7 


1 


Fungi 


Ascomycota 


Pezizomycotina 


Sordariomycetes 


Sordariales 


Neurospora 


crassa 


OR74A, vl.0 


sea squirt 


11 


78,296,155 


14 


2 


Animalia 


Chord ata 


Ascidiacea 


Enterogona 


Cionidae 


Ciona 


intestinalis 


2.0 


roundworm 


12 


100,272,276 


6 


2 


Animalia 


Nematoda 


Chromadorea 


Rhabditida 


Rhabditidae 


Caenorhabditis 


elegans 


10 


green alga 


13 


112,305,447 


17 


1 


Plantae 


Chlorophyta 


Chlorophyceae 


Chlamydomonadales 


Chlamydomonadaceae 


Chlamydomonas 


reinhardtii 


4.0 


arabidopsis 


14 


119,667,750 


5+C+Mt 


2 


Plantae 


Angiosperms 


Magnoliopsida 


Brassicales 


Brassicaceae 


Arabidopsis 


thai i ana 


9 


fruitfly 


15 


130,450,100 


3+XU+Mt 


2 


Animalia 


Arthropoda 


Insecta 


Diptera 


Drosophilidae 


Drosophila 


melanog aster 


3.0 


peach 


16 


227,252,106 


8 


2 


Plantae 


Angiosperms 


Magnoliopsida 


Rosales 


Rosaceae 


Prunus 


persica 


139 


rice 


17 


370,792,118 


12 


2 


Plantae 


Angiosperms 


Monocots 


Poales 


Poaceae 


Oryza 


sativa 


2 


poplar 


18 


417,640,243 


19 


2 


Plantae 


Angiosperms 


Eudicots 


Malpighiales 


Salicaceae 


Popululs 


trichocarpa 


3.0 


tomato 


19 


781,666,411 


12 


2 


Plantae 


Magnoliophyta 


Magnoliopsida 


Solanales 


Solanaceae 


Solanum 


lycopersicum 


2.40 


soybean 


20 


973,344,380 


20 


2 


Plantae 


Angiosperms 


Eudicots 


Fabales 


Fabaceae 


Glycine 


max 


109 


turkey 


21 


1,061,998,909 


30+WZ+Mt 


2 


Animalia 


Chord ata 


Aves 


Galliformes 


Phasianidae 


Meleagris 


gallopavo 


UMD2.70 


zebra fish 


22 


1,412,464,843 


25+MT 


2 


Animalia 


Chord ata 


Actinopterygii 


Cypriniformes 


Cyprinidae 


Danlo 


rerlo 


ZV9.71 


lizard 


23 


1,799,126,364 


6+abcdfgh 


2 


Animalia 


Chord ata 


Reptilia 


Squamata 


Polychrotidae 


Anolis 


carollnensis 


AnoCar2.0 


corn 


24 


2,066,432,718 


10+Mt+Pt 


2 


Plantae 


Angiosperms 


Commelinids 


Poales 


Poaceae 


Zea 


mays 


ZmB73 


mouse 


25 


2,654,895,218 


19+XY 


2 


Animalia 


Chord ata 


Mammalia 


Rodentia 


Muridae 


Mus 


musculus 


mm9 


human 


26 


3,095,693,983 


22+XY+Mt 


2 


Animalia 


Chord ata 


Mammalia 


Primates 


Hominidae 


Homo 


sapiens 


hgl9 



Table S 2 



Reads set 


min(bp) 


mean(bp) 


max(bp) 


mean! 


3,336 


3,642 


3,849 


mean2 


6,690 


7,392 


7,709 


mean4 


13,767 


14,948 


15,710 


mean8 


28,142 


30,160 


31,640 


mcanl6 (only for HG19) 


57,649 


59,214 


60,752 


mcan32 (only for HG19) 


117,139 


120,168 


122,405 



Table S 3: Statistics for reads set. This shows the minimum, mean, and maximum of the mean read lengths 
uses for each simulation across all genomes. There is small variability present depending on the exact set of 
reads used in each simulation. 



Mean read length(bp) 


Number of repeats longer than mean read length 


A.thaliana(120Mbp) 


Fluit fly(130Mbp) 


3,650 


170 


2,956 


7,400 


78 


84 


15,000 


28 


6 


30,000 


8 


2 



Table S 4: Repeat longer than mean reads length in A.thaliana and D.melanogaster This table 
shows the number of repeats longer than the mean read length in the two genomes as computed by aligning 
the genome sequence to itself using Mummer3.23. Only exact matches are counted. Note A. thaliana has 
more repeats than D. melanog aster over 3,650bp, but fewer when considering 7,400bp. Consequently, the 
relative assembly performance flips between the two genomes at the two mean read lengths. 
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Organism 


Genome size 


Chemistry 


Mean read length 


N50 


Data source 


E.coli K12 


4.64 Mbp 


C2 


3,247 bp 


A f A TV Tl 

4.64 Mbp 


Koren et al. (2013) Genome Biology. 14:R101 


S. enterica Newport 


5.01 Mbp 


C2 


3,247 bp 


4.91 Mbp 


Koren et al. (2013) Genome Biology. 14:R101 


E .coli 0157:H7 


5.52 Mbp 


C2 


3,247 bp 


4.32 Mbp 


Koren et al. (2013) Genome Biology. 14:R101 


S. cerevisiae 


12.16 Mbp 


C3 


5,910 bp 


811 Kbp 


Current paper 


A. thaliana 


124.6 Mbp 


C2 


4,137 bp 


8.4 Mbp 


*http:/ /blog. pacificbiosciences.com/2013/08/new-data-release-arabidopsis-assembly.html 


D. melanogaster 


130.4 Mbp 


C3 


10,040 bp 


15.30 Mbp 


http://blog.pacificbiosciences.com/2014/01/data-release-preliminary-de-novo.html 


H. sapiens 


3.10 Gbp 


C3 


7,680 bp 


4.38 Mbp 


http://blog.pacificbiosciences.com/2014/02/data-release-54x-long-read-coverage-for.html 



Table S 5: Assembly statistics using genuine reads 

Data was downloaded from this paper, but we developed a new assembly described in the main manuscript 



Table S6 



Model 


ID 


Scientific 


Name 


Genome Size 


Longest 


Jounal 


year 


Authors 


Sequencing 


Coverage 


Other 


Organism 




Genus 


Species 




Repeat 








Technology 




Technology 


M.jannaschii 


1 


Methanocaldococcus 


jannaschii 


1,664,970 


1,018 


Science 


1996 


C.Bult et al. 


Sanger 


N/A 


NA 


C.hydrrjgenoformaris 


2 


Carboxydothermus 


hydrogenoformans 


2,401,520 


3,098 


PLOS GENETICS 


2005 


Wu et al. 


Sanger 


N/A 


NA 


v/ E ' C °l' 




Escherichia 




4 639 675 




m'^ 11 ^ 






1 ^ — 5an3 c r 


9^r 




Y.pestis 


— — 


Yersinia 


pestis 


4,0 jo, / 


— i^ifl — 


N atu re 


2fiol 


Parkhill et al 


Shotgun — Sanger 




N^ 


B.anthracis 


5 


Bacillus 


anthracis 


5,227,293 


4,651 


Journal of Bacteriology 


2009 


Ravel et al. 


Shotgun - Sanger 


13 


NA 


A.mirum 


6 


Actinosynnema 


mirum 


8,248,144 








Land et al. 




28.9 


NA 


yeast 




Saccharomyces 


cerevisi 


12,157,105 








Goffeau et al. 




N/A 


NA 


7 










Y.lipolytica 


8 


Yarrowia 


lipolytica 


20,502,981 


9,462 


Nature 


2004 


Dujon et al. 


Shotgun - Sanger 


10 


BAC 


stime mold 


9 


Dictyostelium 


discoideum 


34,338,145 


75,813 


Nature 


2005 


Eichinger et al. 


whole-c CS)-N/A 


low 


HAPPY maps/YAC 


Red bread mold 


10 


Neurospora 


crassa 


41,037,538 


2,787 


Nature 


2003 


Galagan et al. 


whole-genome shotgun (WGS)-Sanger 


> 20 


BAC 


sea squirt 


11 


Ciona 


intestinal is 


78,296,155 


1,555 


Science 


2002 


Dehal et al. 


whole-genome shotgun (WGS)-Sanger 


10 


BAC 


roundwo rm 


12 


Caenorhabditis 


elegans 


i nn ?7? 97c 

1UU.Z / ^.Z t 0 


38 987 


Science 


1998 


The C. elegans Sequencing 
Consortium 


whole-genome shotgun (WGS)-Sanger 




YAC 


green aiga 


13 


Chlamydomonas 


reinhardtii 


112,305,447 


8,892 


Science 


2007 


Merchant et al. 


whole-genome shotgun (WGS)-NVA 


13.00 


NA 




14 


Arabidopsis 


thaliana 


119,667,750 


44,000 


Science 


2000 


varies per chromosome 


'■ ■ 


10-15 


YAC/BAC/TAC 


fruitfly 


15 


Drosophila 


melanog aster 


130,450,100 


30,892 


Science 


2000 


Adams et al. 


whole-genome shotgun (WGS) 
Sanger/lllumina 


>1.5 


BAC 


peach 


16 


Prunus 


persica 


227,252,106 


6,416 


Nature 


2013 


The International Peach Genome 
Initiative 


Sanger whole-genome shotgun 
Sanger/lllumina 


0.5-1.5 


BAC 


rice 


17 


Oryza 


sativa 


370,792,118 


56,270 


Science 


2002 


Goff et al. 


whole-genome shotgun (WGS)/lllumina 


18.50 


BAC 
























genetic mapping 
enabled 
Chromosome-scale 
Reconstruction 


poplar 


18 


Popululs 


trichocarpa 


417,640,243 


5,005 


Science 


2006 


Tuskan et al. 


whole-genome shotgun (WGS) 


7.56 




















Roche/454 Titanium shotgun 






tomato 


19 


Solan urn 


lycopersicum 


781,666,411 


16,733 


Nature 


2012 


The Tomato Genome Consortium 


SOLiD, lllumina 
Sanger paired-end reads 


varies 


BAC 


soybean 


20 


Glycine 


max 


973,344,380 


5,649 


Nature 


2010 


Schmutz et al. 


whole-genome shotgun (WGS) 


6.50 


physical and 
High-density 
Genetic maps 


turkey 


21 


Meleagris 


gallopavo 


1,061,998,909 


4,652 


PLOS BIOLOGY 


2010 


Dalloul et al. 


whole-genome shotgun (WGS) 
Roche 454 and lllumina GAM data 


5x/25x 


BAC 


zebra fish 


22 


Danio 


rerlo 


1,412,464,843 


71,314 


Genome Research 


2001 


Broughton et al. 


PCR/Sanger 


sufficient 


BAC 


lizard 


23 


Anolis 


carollnensis 


1,799,126,364 


6,218 


Nature 


2011 


Alfbldi et al. 


whole-genome shotgun (WGS) 


6.00 


BAC 


corn 


24 


Zea 


mays 


2,066,432,718 


66,069 


Science 


2009 


Schnable et al. 


whole-genome shotgun (WGS) 
Sanger/lllumina 


N/A 


BAC 


mouse 


25 


Mus 


musculus 


2,654,895,218 


152,358 


Nature 


2002 


Wade et al. 


whole-genome shotgun (WGS)-Sanger 


N/A 


BAC 


human 


26 


Homo 


sapiens 


3,095,693,983 


119,819 


Nature 


2001 


International Human Genome 
Sequencing Consortium 


Hierarchical shotgun -Sanger 


8-10 


BAC 




Figure 1: Lander- Waterman Statistics Simulation Human genome assembly performance based on 
Lander- Waterman statistics is shown. Note the mean contig size increases continuously with deeper coverage, 
even beyond the genome size itself. Also note the read length has a linear effect, while performance is 
dominated by coverage. 

13 



Figure 2: Genome assembly performance of 26 species 



For each genome listed below, the figures plot the assembly performance as a function of coverage and as a 
function of read length. 

52.1 M.jannaschii(Euryarchaeota) 

52.2 C.hydrogenoformans(Firmicutes) 

52.3 E.coli(Eubacteria) 

52.4 Y.pestis(Proteobacteria) 

52.5 B.anthracis(Firmicutes) 

52.6 A.mirum(Actinobacteria) 

52.7 S.cerevisiae(Yeast) 

52.8 Y.lipolytica(Fungus) 

52.9 D.discoideum(Slime mold) 

52.10 N.crassa(Red bread mold) 

52.11 C.intcstinalis(Sea squirt) 

52.12 C.elegans(Roundworm) 

52.13 C.reinhardtii(Green algae) 

52.14 A.taliana(Arabidopsis) 

52.15 D.melanogaster(Fruitfly) 

52.16 P.persica(Peach) 

52.17 O.sativa(Rice) 

52.18 P.trichocarpa(Poplar) 

52.19 S.lycopersicum(Tomato) 

52.20 G.max(Soybean) 

52.21 M.gallopavo(Turkey) 

52.22 D.rerio(Zebrafish) 

52.23 A.carollnensis(Lizard) 

52.24 Z.mays(Corn) 

52.25 M.musculus(Mouse) 

52.26 H.sapiens(Human) 
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M.jannaschii Assembly by Read Length 
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(Figure S2.4) Y.pestis 
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B.anthracis Assembly by Read Length 
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(Figure S2.5) E. antliracis 




A.mirum Assembly by Read Length 
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(Figure 52.6) A.mirum 



S.cerevisiae Assembly by Coverage 
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(Figure S2.7) S.cerevisiae 



Y.lipolytica Assembly by Coverage 




0 5 10 15 20 25 30 35 40 45 

Coverage 



Y.lipolytica Assembly by Read Length 
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{Figure S2.3) Y.lipolytica 
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N.crassa Assembly by Read Length 
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(Figure S2.10) N.crassa 



C.intestinalis Assembly by Coverage 
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(Figure S2.ll) C.intestinalis 



C.elegans Assembly by Coverage 
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C.elegans Assembly by Read Length 
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(Figure S2.12) C.elegans 



C.reinhardtii Assembly by Coverage 
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C.reinhardtii Assembly by Read Length 
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Arabidopsis Assembly by Coverage 
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(Figure S2.14) Arabidopsis 
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D.melanogaster Assembly by Coverage 
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(Figure S2.15) D.melanogaster 



Peach Assembly by Coverage 
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(Figure S2.16) Peach 



Rice Assembly by Coverage 
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(Figure S2 . 17 ) Rice 



Poplar Assembly by Coverage 
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(Figure S2.19) Poplar 



Tomato Assembly by Coverage 
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(Figure S2.19) Tomato 



Soybean Assembly by Coverage 
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(Figure S2.20) Soybean 



Turkey Assembly by Coverage 
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(Figure S2.21) Turkey 
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ZebraFish Assembly by Coverage 
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ZebraFish Assembly by Read Length 
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(Figure S2.22) ZehraFish 



Lizard Assembly by Coverage 
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Lizard Assembly by Read Length 
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(Figure S2.23) Lizard 
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Corn Assembly by Coverage 
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(Figure S2.24) Corn 



Mouse Assembly by Coverage 
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Mouse Assembly by Read Length 
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(Figure S2.25) House 
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(Figure S2.26) HG19 



Repeats in Rice Genome 
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Figure 3: Repeats in rice genome The number of repeats in rice genome (370Kbp) is shown. Mummer3.23 
is used to compute exact matches with options "-maxmatch -n -1 100 -b -c" The number of lOObp repeats 
is over 41K and the longest one is around 44kbp, which clears shows that rice genome is not a random 
sequence. 
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Figure 4: Repeats in random sequence Wc simulated a random genome that is identical to rice in terms of 
the number of chromosomes and their sizes. The probability to have a lOObp repeat is 1 /4 100 ps 1/(1.6 x 10 60 ) , 
which is beyond the size of rice genome. In contrast to the real rice genome (Figure S3), the longest exact 
repeat in a random genome is 30 bp. 
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Figure 5: Repeat effect on genome assembly in A.thaliana and D.melanogaster 

The assembly performance of A.thaliana and D.melanogaster using lOx coverage shows an interesting reversal 
between short and long reads. With relatively short reads (meant), A. thaliana has superior assembly 
performance to D. melanogaster. However, at longer read lengths the relative performance reverses. These 
trends are explained by the complex genome structures where D. melanogaster has more short repeats than 
A. thaliana, but A. thaliana has more long repeats. 
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Figure 6: The correlation between genome size and longest repeat The relationship between genome 
size and longest repeat is plotted. There is clear linear trend on this log-log plot (blue) while a few outliers 
are above the trend (green) and a few are below the trend (red). We speculate the outliers below the trend, 
with unexpectedly short longest repeats are due, in part, to poor reference genome quality. 
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Performance Comparison of SVR and baseline Machine Learning Algorithm 




lasso ridge SVR poly d1 SVR poly d2 SVR poly d3 SVR poly d4 SVR RBF 



Figure 7: Predictive power of SVR The predictive power of SVR using different kernel is compared 
to traditional regression methods. Mean of Residual and Mean Squared Error (MSE) are used with cross 
validation. Under both criteria, SVR using RBF kernel performed the best. 
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Figure 8: Effect of read simulation randomness on genome assembly performance Since reads are 
generated with randomly assigned positions, different runs of the simulation may have different assembly 
performances. For example, in one simulation a particular repeat may be spanned by sufficiently long reads 
to unambiguously assemble, but not in another simulation. To measure how significant such variation is, 
we simulated 5 different sets of sequencing reads for C elegans. While there are some minor fluctuations in 
performance, the overall performance is consistent between runs suggesting the dominate factors are overall 
coverage and read length distribution. 
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This is the Genome Assembly Performance Prediction Service. If you have any queries please email 
Hayan Lee( hlee(5)cshl.edu ). 

Although assembly performance is a function of genome size, read length, coverage and repeats, in 
this prediction model, we only used 3 features; genome size, read length and coverage for the 
simplicity. 

Given genome size, we internally set read lengths and coverages for you. With 3 features, our 
model predicts the expected performance of assembly. Performance is defined as follows: 

Performance(%) = N50 of assembly / N50 of chromosomes 
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Figure 9: Genome assembly performance prediction webservice To serve assembly 
community effectively, we launched a web service "Genome assebmly performance prediction 
(http://qb.cshl.edu/asm_model/predict.html)". All you need is to input genome size. We set read length 
and coverages internally. Repeats are implicitly invoved because our model considers repeasts. Once you hit 
"submit" button, our service will show you two graph; Same information in different point of views. One in 
terms of reads lengths, the other in terms of coverage. 
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1 Supplementary Note 1 : Assembly Statistics Table Description 



Information regarding specific assembly parameters can be found in Assembly_Results_and_Parameters.zip. 
Each correction method will have the parameters used in each step of the respective pipelines as well as final 
assembly statistics. 



2 Supplementary Note 2 : ECtools and Downsampling Analysis 

Recent advances in leading long read sequencing technology have greatly expanded the scope of genome 
assembly projects. Large genomes with prohibitively long and complex repeat structures are now becoming 
tractable with current sequencing techniques. The tradeoff for obtaining these long reads however, is a 
substantially increased per base error-rate. As of current, no assembly software is equipped to handle reads 
with error-rates as high as 15%, which is the norm for long read technologies such as SMRT sequencing 
from Pacific Biosciences. It is instead necessary to perform a pre-assembly correction step that attempts to 
reduce the error rate of the sequencing reads to a level (generally <3%) suitable for current assembly software. 

Correction techniques can be broken into two categories, hybrid and non-hybrid. Hybrid correction tech- 
niques require a separate library from the same sample to be sequenced using a technology that has a lower 
error rate (generally Illumina) . These high- identity reads are then mapped to the longer reads and a consen- 
sus algorithm is used to produce a "corrected" long read. Of the most well known algorithms is pacbioToCA, 
which is distributed with the Celera Assembler. Non-hybrid techniques use a similar approach but instead 
of using a second high-identity library, they require very deep long-read coverage to build the consensus. 
HGAP is currently the leading software that implements this algorithm for correction. Both algorithms rely 
on mapping reads to each other and the building a consensus. This approach has been very successful in 
small, relatively simple, genomes such as microbes. Even eukaryotes such as yeast can be assembled into 
near-finished drafts using only shotgun long-read sequencing (Table S7). 

However, error-correction is an imperfect operation. PacbioToCA has issues correcting reads with very high 
error rate (Figure 4). This is mainly a limitation in the overlapper which only supports error rates up to 
20% In general, because all techniques require an initial mapping operation, repeats will make the exact 
placement of reads difficult. PacbioToCA has further issues in repeat regions because the overlapper only 
uses k-mers that occur below a predefined repeat frequency. In many cases, alignments will not even be 
seeded in repeat regions XXX TODO Crazy Graph on Wall.pdf. Because HGAP uses long reads for self 
correction, it becomes easier to find unique alignments and increases accuracy in repeat regions. HGAP's 
outperformance of hybrid techniques speaks to the advantage long reads have over short reads in mapping 
specificity. Clearly self-correction of long reads gives the best overall performance, however the coverage 
required to successfully run the correction can be quite expensive in genomes larger than a few megabases. 
In many larger genomes, where the goal today is not to produce a finished genome but a usable draft at a 
reasonable cost, a hybrid-correction approach may be more valuable. 

In an effort to improve upon existing hybrid approaches and use what has been learned from long read 
self-correction, we developed ECtools. ECtools takes an existing short-read contig assembly and maps raw 
long reads onto the assembly. Bases that do not agree with the high-identity short read assembly are as- 
sumed to be errors and are corrected in the long reads. In regions where the long reads span multiple 
short-read contigs, the longest increasing subsequence algorithm is used (as implemented in the MUMmer 
package) to find the best set of short read unitigs that span the long read. In essence, the long reads are 
being used as a backbone to build a layout of the short-read unitigs and the subsequent layout is used 
to correct the long read. Because the short-read unitigs are longer than the short reads themselves, they 
are generally easier to align to the highly errored long reads. This approach allows regions of high error to 
be spanned and corrected where in previous approaches such as PacbioToCA they would have been discarded. 

The approach works well according to our assembly statistics (Table S7). In small genomes such as yeast, 
results comparable to PacbioToCA can be seen. In addition to spanning small regions of uncorrected bases, 
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Assembled Strain 



Reference Genome used for Comparison 



E. coli (K12) 

S. cerevisiae (W303) S. c 
A. thaliana (LerO) 
O. sativa (IR64) 

Table S 7: Assembled strain 



E. coli MG1655 
iiae S288C R6411_20110203 
A. thaliana (ColO) TAIR10 
O. sativa ASM465vl.l6 



E. Coli Uncorrected (All Libraries) 




Read Length (bp) 



Figure 10: E. coli K12 raw read identity (all SMRT cells) 



ECtools has a more permissive trimming algorithm than PacbioToCA. Rather than splitting long reads at 
small uncorrected regions, ECTools allows the user to control the resulting read's final percent identity. The 
user can lower the identity if she prefers to have reads that are longer, but with more uncorrected bases. Or, 
the user can increase the identity if she prefers generally shorter reads that have higher identity. In small 
genomes with abundant coverage, it is generally more appropriate to correct reads as perfectly as possible. 
However, in larger genomes, it may be more important to retain the read continuity rather than per base 
identity. 

To assess the performance of these different methods we first looked at the small 4.6MB E. coli K12 genome. 
A total of 614x coverage of raw pacbio data was obtain from ? with a total of 69x coverage of reads greater 
than lOkb. Correction of all of the data using HGAP and subsequent assembly produced a perfect assembly 
with a single chromosome at 99.99% identity. To highlight the power and cost effectiveness of hybrid correc- 
tion approaches, we obtained 2x300bp MiSeq data from Illumina's public BascSpace repository which had 
over 2914x coverage of the E. coli genome. We selected SMRT Cell SRR801650 which had 47x coverage and 
6x coverage of reads greater than lOkb and corrected it with both pacbioToCA as well as ECTools. Both 
produce perfect single contig assemblies at 99.98% identity with respect to the reference. 
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Figure 11: E. coli K12 HGAP read identity (all SMRT cells) 
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Figure 12: E. coli K12 raw read identity (SRR801650) 
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Figure 13: E. coli K12 read identity after correction with ECTools (SRR801650) 
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Figure 14: E. coli K12 read identity after correction with pacbioToCA (SRR801650) 
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Figure 15: E. coli K12 read length histogram comparing uncorrected reads, reads corrected with pacbioToCA 
and reads corrected with ECTools (SRR801650) 
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Looking at the correction accuracy figures ??, ??, ??, ??, ??, ??, we see that all correction approaches 
correct the majority of the reads very well which is not a surprise given that the genome assembles into a 
single perfect contig. 

Next, we assembled the larger S. cerevisiae W303, which is 12Mbp and has 14 chromosomes. The genome 
was sequenced to very deep coverage using the Pacific Biosciences RS II SMRT sequencer (237x). With this 
very deep coverage, self-correction is the preferred method for reasons stated previously and so HGAP was 
used to correct reads that were greater than lOkb. W303 was assembled nearly perfectly with all but a single 
chromosome represented in a single contig. The one chromosome break was due to a 35kb repeat cluster; 
too long to be spanned with enough coverage in our dataset. 

The success seen in both E. coli and S. cerevisiae highlights the power of the current sequencing technology. 
However, sequencing S. cerevisiae to such depth required a total of 16 SMRT cells. Although acceptable 
if the goal is to produce a high quality finished assembly, for applications that only require a reasonably 
good but not perfect assembly, it may be more appropriate to produce fewer long reads and use a hybrid 
method to correct them. This could be a good approach if, perhaps, the goal is to sequence many individuals, 
rather than a pooled population. Using the HGAP assembly as a representative upper bound, we chose a 
single SMRT Cell from the E. cerevisiae dataset and ran hybrid-corrections (pacbioToCA and ECTools) for 
comparison. The W303 hybrid assemblies (Table S7) when compared to HGAP by contig N50 lag quite sig- 
nificantly. ECTools produced an N50 contig size of roughly 413kb and the pacbioToCA assembly had an N50 
of about 361kb. This is only about half of the HGAP's 810kb N50 contig size. However, Ectool's N50 count 
is 9, while HGAP's N50 count is only 6 which means that in the best half of the Ectool's assembly, there 
are only a few extra breakpoints. In these small genomes the N50 count becomes an important comparative 
statistic because the assemblies are approaching perfection and a single breakpoint can mean a large change 
in N50 bases. The conclusion from this experiment is that, even in eukaryotic genomes, one can produce a 
very good assembly that approaches the upper limit (HGAP) using a fraction of the long read data and a 
hybrid correction method. 

In yeast, ECTools and pacbioToCA were quite comparable. The read correction graphs (Figures 9,10,11), 
show both perform quite well with only a small fraction of the reads below 99% identity. The length dis- 
tributions (Figure 12) also look comparable except for reads on the high end. As can be seen, there are in 
general more reads of longer length produce by ECTools than pacbioToCA. This speaks to ECTools' more 
conservative splitting algorithm which is designed to preserve read continuity. 

The preservation of read continuity helps ECTools outperform pacbioToCA in yeast but it becomes more 
important in larger genome assemblies such as Arabidopsis. Pacific Biosciences has released public Ara- 
bidopsis long- read data for the Ler-0 strain. The project produced over 118x coverage with 38x of reads over 
lOkb using 93 SMRTcells. With this very deep coverage, a non-hybrid approach is the preferred method 
for correction. Assembly of this data after the HGAP correction of reads greater than lOkb resulted in an 
assembly with a contig N50 of 8.4Mb; some chromosome arms are in single contigs. 

Although this assembly is very good, sequencing 93 SMRTcells is quite expensive today. Depending on 
the budget, and the desired assembly quality, it may be more cost efficient to attempt a hybrid-correction 
method. Similar to yeast, we downsampled the Arabidopsis data and tested each approach at various cover- 
age levels (Figure 17). As can be seen, at coverage levels lower than 40x, HGAP's self-correction lags behind 
the hybrid approaches quite significantly. This is likely due to the direct relationship between error-rate and 
required coverage for consensus. As error rate increases, more per-base coverage is required to discriminate 
between true bases and errors. Setting HGAP aside and focusing only on the hybrid approaches, it can 
be seen that ECTools edges out PacbioToCA at all coverage levels and begins to have a greater N50 base 
advantage at higher coverages. Looking at the read length histograms (Figure 13), again, ECTools is more 
conservative in its splitting and fewer long reads are being discarded than with pacbioToCA. This fact can 
also be seen in the identity histograms, where we can see that ECTools produces a larger number of long 
corrected reads (Figures 14,15,16). This experiment gave excellent insight into the comparative assembly 
performance between hybrid and non-hybrid approaches. 
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Figure 17: Yeast strain W303 read identity after correction with ECTools 
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Figure 18: Yeast strain W303 read identity after correction with pacbioToCA 
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Figure 19: Yeast strain W303 read length histogram comparing uncorrected reads, reads corrected with 
pacbioToCA and reads corrected with ECTools. 
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Arabidopsis 
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Figure 20: Arabidopsis read length histogram comparing uncorrected reads, 
pacbioToCA and reads corrected with ECTools. 
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With the success in Arabidopsis, we attempted to apply these approaches to even larger genomes. A sample 
for the rice strain IR64 was sequenced to roughly 16x total coverage using the Pacific Biosciences RS II. 
Because this genome is quite large compared to those previously discussed, producing enough sequence to 
run HGAP is prohibitively expensive at this time; the only feasible option is a hybrid approach. A Miseq 
2x300bp run with a 450bp insert was used as the high-identity library for error correction. 

Examining the post-correction sequence identity histograms, it is clear that ECTools is better able to cope 
with the increased complexity of the rice genome as compared to pacbioToCA (Figures 18,19,20). In contrast 
to the previously discussed organisms, rice has substantially more repeats. For reasons mentioned above, 
ECTools was developed to improve in some of these repeat regions where pacbioToCA has difficulty. The 
final results show that ECTools is able to produce an assembly with an N50 contig length of 272kb while 
pacbioToCA assembly only has a 143kb N50. This is a substantial improvement over short-read assemblies 
such as the Miseq only assembly that had an N50 contig size of only 19kb. 

In conclusion, the previous discussion presented results from the new error correction routine ECTools. It 
was first shown that ECTools can perform on par with other hybrid-error correction techniques when cor- 
recting small genomes. In larger genomes, ECTools had a clear advantage, and was demonstrated to be 
able to produce substantially more contiguous assemblies than other hybrid algorithms such as pacbioToCA. 
Although, as discussed, hybrid approaches generally lag behind long-read self-correction approaches; for 
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Figure 21: Arabidopsis Ler-0 raw read identity 
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Figure 22: Arabidopsis Ler-0 read identity after correction with ECTools 
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Figure 23: Arabidopsis Ler-0 read identity after correction with pacbioToCA 



63 



ArahiHnn<;ic: 

Downsample 
HGAP 












Coverage 


Max Contig Size 


Men 


iNou cnt 




10 316393 


Kl/A 

N/A 


M / A 

N/A 




20 263740 


I DOO 


1 /ooo 




30 


237063 


70Qi: 

/ oyo 


yl C^-1 
4DO 1 




40 


349266 


A 07A £ 


Q -1 yl 
0 14 




80 5235677 


1 0DO0 1 O 


OP, 




izu 




o4^yo 1 0 


e 
O 






















am ouiuupoio 

Downsample 
Ectools 












Coverage 


Max Contig Size 


N50 


N50 cnt 




10 


316400 


1 4oUo 


Zo4U 




20 


385206 


/I OOO 


OUI 




30 


1588613 


oi lytso 


-1 nn 
1 UU 




40 


3841500 


D 1 Dooy 


04 




80 


5370883 


T-l n-1 CD 

/ 1 U 1 Do 






120 


4957746 


-1/1-1 CM CC 

14 1 y4oo 


O/l 
Z4 
































Arabidopsis 

Downsample 

PacbioToCA 












Coverage 


Max Contig Size 


N50 


N50 cnt 




10 


277983 


10771 


3017 




20 


347887 


44420 


756 




30 


1488711 


212193 


150 




40 


1618669 


365151 


90 




80 


4191848 


691660 


41 




120 


6202231 


845345 


33 



Figure 24: Arabidopsis downsampling experiment 
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Figure 26: Rice strain IR64 read identity after correction with ECTools 
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Figure 27: Rice strain IR64 read identity after correction with pacbioToCA 
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certain applications a hybrid error-correction may be a better choice. ECTools' intended use is for hybrid 
projects in genomes larger than a few megabases. 
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